Local equivalences of distances between clusterings

نویسنده

  • Marina Meilă
چکیده

In comparing clusterings, several different distances and indices are in use. We prove that the Misclassification Error distance, the Hamming distance (equivalent to the unadjusted Rand index), and the dχ2 distance between partitions are equivalent in the neighborhood of 0. In other words, if two partitions are very similar, then one distance defines upper and lower bounds on the other and viceversa. The proof is geometric and relies on the convexity of a certain set of probability measures. To my knowledge, this is the first result of its kind. The motivation for this work is in the area of data clustering. Practically, these distances are frequently used to compare two clusterings of a set of observations. Theoretically, such distances are involved in formulating and proving properties of clusterig algorithms. Besides, our results apply to any pair of finite valued random variables, and provides simple yet tight upper and lower bounds on the χ measure of (in)dependence valid when the two variables are strongly dependent.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

MultiDendrograms: Variable-Group Agglomerative Hierarchical Clusterings

MultiDendrograms is a Java-written application that computes agglomerative hierarchical clusterings of data. Starting from a distances (or weights) matrix, MultiDendrograms is able to calculate its dendrograms using the most common agglomerative hierarchical clustering methods. The application implements a variable-group algorithm that solves the non-uniqueness problem found in the standard pai...

متن کامل

Equivalences in Bicategories

In this paper, we establish some connections between the concept of an equivalence of categories and that of an equivalence in a bicategory. Its main result builds upon the observation that two closely related concepts, which could both play the role of an equivalence in a bicategory, turn out not to coincide. Two counterexamples are provided for that goal, and detailed proofs are given. In par...

متن کامل

Annotation-based Distance Measures for Patient Subgroup Discovery in Clinical Microarray Studies

MOTIVATION Clustering algorithms are widely used in the analysis of microarray data. In clinical studies, they are often applied to find groups of co-regulated genes. Clustering, however, can also stratify patients by similarity of their gene expression profiles, thereby defining novel disease entities based on molecular characteristics. Several distance-based cluster algorithms have been sugge...

متن کامل

From Comparing Clusterings to Combining Clusterings

This paper presents a fast simulated annealing framework for combining multiple clusterings (i.e. clustering ensemble) based on some measures of agreement between partitions, which are originally used to compare two clusterings (the obtained clustering vs. a ground truth clustering) for the evaluation of a clustering algorithm. Though we can follow a greedy strategy to optimize these measures a...

متن کامل

Weighted Ensemble Clustering for Increasing the Accuracy of the Final Clustering

Clustering algorithms are highly dependent on different factors such as the number of clusters, the specific clustering algorithm, and the used distance measure. Inspired from ensemble classification, one approach to reduce the effect of these factors on the final clustering is ensemble clustering. Since weighting the base classifiers has been a successful idea in ensemble classification, in th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009